skip to main content


Search for: All records

Creators/Authors contains: "Cleary, Alan"

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

  1. Recent advancements in DNA sequencing and assembly have drastically lowered cost and improved quality. This has allowed for collections of genomes to be created that better reflect the variability within a single species. These pangenomes continue to grow in size and scope as new sequences are added, yet such collections have already proven to be challenging to handle without significant computational infrastructure, with the primary challenge being the large data size. Unfortunately, existing compression algorithms do not allow analysis to be performed directly on the compressed data. Furthermore, many common compression paradigms do not take advantage of the high similarity between genomes from the same species, resulting in compression that scales relative to data size rather than relative to information content. In this work, we present and propose novel grammar-based compression algorithms designed specifically for pangenome analysis. By leveraging maximal repeats, these algorithms have the potential to enable pangenome analysis at unprecedented scales. 
    more » « less
    Free, publicly-accessible full text available June 5, 2024
  2. It is known that a context-free grammar (CFG) that produces a single string can be derived from the compact directed acyclic word graph (CDAWG) for the same string. In this work, we show that the CFG derived from a CDAWG is deeply connected to the maximal repeat content of the string it produces and thus has O(m) rules, where m is the number of maximal repeats in the string. We then provide a generic algorithm based on this insight for constructing the CFG from the LCP-intervals of a string in O(n) time, where n is the length of the string. This includes a novel data-structure to support stabbing queries on LCPintervals in O(1+k) time after O(n) preprocessing time, where k is the number of intervals stabbed. These results connect the CFG to properties of the string it produces and relates it to other string data-structures, allowing it to be studied independently of the CDAWG and providing opportunity for innovation of grammar-based compression algorithms. 
    more » « less
  3. Abstract Summary

    The Genome Context Viewer is a visual data-mining tool that allows users to search across multiple providers of genome data for regions with similarly annotated content that may be aligned and visualized at the level of their shared functional elements. By handling ordered sequences of gene family memberships as a unit of search and comparison, the user interface enables quick and intuitive assessment of the degree of gene content divergence and the presence of various types of structural events within syntenic contexts. Insights into functionally significant differences seen at this level of abstraction can then serve to direct the user to more detailed explorations of the underlying data in other interconnected, provider-specific tools.

    Availability and implementation

    GCV is provided under the GNU General Public License version 3 (GPL-3.0). Source code is available at https://github.com/legumeinfo/lis_context_viewer.

    Supplementary information

    Supplementary data are available at Bioinformatics online.

     
    more » « less
  4. Abstract

    Legumes, comprising one of the largest, most diverse, and most economically important plant families, are the subject of vibrant research and development worldwide. Continued improvement of legume crops will benefit from the recent proliferation of genetic (including genomic) resources; but the diversity, scale, and complexity of these resources presents challenges to those managing and using them. A workshop held in March of 2019 addressed questions of data resources and priorities for the legumes. The workshop identified various needs and recommendations: (a) Develop strategies to effectively store, integrate, and relate genetic resources collected in different projects. (b) Leverage information collected across many legume species by standardizing data formats and ontologies, improving the state of metadata about datasets, and increasing use of the FAIR data principles. (c) Advocate for the critical role that curators exercise in integrating complex datasets into databases and adding high value metadata that enable downstream analytics and facilitate practical applications. (d) Implement standardized software and database development practices to best leverage limited developer time and expertise gained from the various legume (and other) species. (e) Develop tools and databases that can manage genetic information for the world's plant genetic resources, enabling efficient incorporation of important traits into breeding programs. (f) Centralize information on databases, tools, and training materials and establish funding streams to support training and outreach.

     
    more » « less